Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures by yash2710 · Pull Request #5780 · Azure/azure-cosmos-dotnet-v3

yash2710 · 2026-04-13T19:28:05Z

Description

When a Cosmos DB region becomes unavailable, metadata requests (e.g., partition key range reads) fail with 503/20003 and are retried only within the same region until exhausted, with no cross-region failover. This change enables cross-region retry at both the metadata request level and the operation level.

Problem

MetadataRequestThrottleRetryPolicy retried metadata requests in the same region using a location index, but never marked the failing endpoint as unavailable in the LocationCache. If all retries within the policy were exhausted, the exception propagated without any signal to route future requests to a different region. Additionally, the CosmosException thrown from failed metadata requests was not caught in AbstractRetryHandler, preventing ClientRetryPolicy from evaluating it for cross-region failover. The Change Feed Processor's BootstrapperCore bypasses the handler pipeline entirely and had no retry logic for regional failures.

Changes

MetadataRequestThrottleRetryPolicy - On regional failures (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound, HttpRequestException, non-user OperationCanceledException):

Marks the failing endpoint as unavailable via GlobalEndpointManager.MarkEndpointUnavailableForRead() so the LocationCache deprioritizes that region for future requests
Retries on the next preferred region (existing IncrementRetryIndex logic, limited to PreferredLocationCount)
After exhausting all regions, returns NoRetry so the exception propagates to the operation-level retry policy

AbstractRetryHandler - Added catch (CosmosException) in ExecuteHttpRequestAsync so that metadata failures (which throw CosmosException from CosmosHttpClientCore) reach ClientRetryPolicy.ShouldRetryAsync(Exception) for operation-level cross-region retry.

BootstrapperCore - Added retry logic for CosmosException and HttpRequestException in InitializeAsync() (up to MaxInitializationRetries = 3), since the Change Feed Processor bypasses the handler pipeline and has no ClientRetryPolicy coverage.

Testing

10 unit tests for MetadataRequestThrottleRetryPolicy covering: first-attempt retry, retry exhaustion, regional failure status codes, HttpRequestException, non-user OperationCanceledException, user cancellation, and 429 throttling
3 unit tests for BootstrapperCore covering: CosmosException retry + success, HttpRequestException retry + success, and retry exhaustion
All 12 existing RetryHandlerTests continue to pass

Type of change

Please delete options that are not relevant.

[] Bug fix (non-breaking change which fixes an issue)
[] New feature (non-breaking change which adds functionality)
[] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[] This change requires a documentation update

Closing issues

closes #5660

azure-pipelines · 2026-04-13T19:28:22Z

Azure Pipelines: Successfully started running 1 pipeline(s).

github-actions

All good!

azure-pipelines · 2026-04-30T19:06:40Z

Azure Pipelines: Successfully started running 1 pipeline(s).

azure-pipelines · 2026-04-30T19:06:45Z

Azure Pipelines: Successfully started running 1 pipeline(s).

NaluTripician · 2026-04-30T19:09:05Z

@sdkReviewAgent-2

xinlian12 · 2026-04-30T19:57:48Z

                    await this.synchronizer.CreateMissingLeasesAsync().ConfigureAwait(false);
                    await this.leaseStore.MarkInitializedAsync().ConfigureAwait(false);
                }
+                catch (CosmosException ex) when (retryCount < MaxInitializationRetries)


🟡 Recommendation — Overbroad Exception Filter: Catches all CosmosException types including non-transient errors

The when guard only checks retryCount, but the XML doc and comment explicitly say this is for "regional error (e.g., CosmosException with 503 or HttpRequestException)". The code doesn't enforce this — a 400 BadRequest, 404 NotFound, or 409 Conflict will also be caught and retried 3 times with 15-second delays (up to 45 seconds wasted on a deterministic failure).

This is inconsistent with MetadataRequestThrottleRetryPolicy.ShouldRetryInternalAsync() in this same PR, which carefully discriminates by status code (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound).

Concrete failure scenario: If the lease container is deleted mid-initialization, CreateMissingLeasesAsync() throws a 404 CosmosException. This catch retries 3× against the same endpoint with the same result, adding ~45 seconds of latency before finally propagating.

Suggestion: Add a status code filter to align with the policy:

catch (CosmosException ex) when (retryCount < MaxInitializationRetries && (ex.StatusCode == HttpStatusCode.ServiceUnavailable || ex.StatusCode == HttpStatusCode.InternalServerError || (ex.StatusCode == HttpStatusCode.Gone && ex.SubStatusCode == (int)SubStatusCodes.LeaseNotFound) || (ex.StatusCode == HttpStatusCode.Forbidden && ex.SubStatusCode == (int)SubStatusCodes.DatabaseAccountNotFound)))

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:58:02Z

                        throw;
                    }
                }
+                catch (CosmosException cosmosException)


🟡 Recommendation — Missing test coverage for the most critical change in this PR

This new catch (CosmosException) block is the key change enabling the entire operation-level cross-region retry for metadata failures. Without it, CosmosException escapes to SendAsync's outer catch at line 38 and is silently converted to a ResponseMessage without consulting ClientRetryPolicy.

However, there is no test in RetryHandlerTests.cs (or elsewhere) that verifies this path works. The existing RetryHandlerDoesNotRetryOnException test throws a generic Exception, and RetryHandlerHttpClientExceptionRefreshesLocations covers HttpRequestException — but neither exercises the CosmosException catch.

Suggestion: Add a test in RetryHandlerTests.cs following the RetryHandlerHttpClientExceptionRefreshesLocations pattern:

Have the inner handler throw a CosmosException (503)

Configure the mock retry policy to return RetryAfter on the first ShouldRetryAsync(Exception) call

Verify the retry policy is consulted and the handler retries the operation

Add a complementary test verifying that when the policy returns NoRetry, the CosmosException re-throws (not silently converted)

This is especially important because NamedCacheRetryHandler also extends AbstractRetryHandler — its InvalidPartitionExceptionRetryPolicy will now see CosmosException too, and should be verified to handle it safely (it returns NoRetry, causing re-throw, which is correct).

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:58:17Z

+            mock.Setup(gem => gem.ResolveServiceEndpoint(It.IsAny<DocumentServiceRequest>()))
+                .Returns(primaryEndpoint);
+            // Default PreferredLocationCount = 0 → maxRetries = Max(1,0) = 1
+            mock.Setup(gem => gem.PreferredLocationCount).Returns(0);


🟡 Recommendation — All tests use PreferredLocationCount = 0, missing multi-region cycling coverage

Every test in this file uses CreateMockEndpointManager with PreferredLocationCount = 0, which yields maxUnavailableEndpointRetryCount = Max(1, 0) = 1. This means only one retry is ever exercised before exhaustion.

In production, most accounts have multiple preferred regions (e.g., 3). The core value proposition of this PR — cycling through regions on metadata failure — is never tested with more than a single retry. If IncrementRetryIndexOnUnavailableEndpointForMetadataRead() had an off-by-one error at higher retry counts, or if OnBeforeSendRequest failed to properly resolve different endpoints on successive retries, no test would catch it.

Suggestion: Add a test with PreferredLocationCount = 3 that:

Returns different endpoints from ResolveServiceEndpoint on each call (region1, region2, region3)

Simulates 3 sequential 503 failures (calling OnBeforeSendRequest + ShouldRetryAsync in a loop)

Asserts each returns ShouldRetry = true with MarkEndpointUnavailableForRead called for the correct endpoint

Asserts the 4th attempt returns NoRetry

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:58:27Z

+                return Task.FromResult(this.HandleRegionalFailure());
+            }
+
+            if (exception is OperationCanceledException && !cancellationToken.IsCancellationRequested)


🟢 Suggestion — Novel pattern: treating non-user OperationCanceledException as a regional failure

This is a new behavioral pattern not present in any sibling retry policy. ClientRetryPolicy.ShouldRetryAsync(Exception) does not treat OperationCanceledException as a regional failure requiring endpoint marking and cross-region retry — it only marks per-partition endpoints and falls through to the throttling policy (which returns NoRetry).

The rationale here (transport timeout = region issue) is reasonable for metadata requests, since a timeout on a metadata read is a strong signal of regional unavailability. But this creates a behavioral divergence between the metadata-level and operation-level policies interpreting the same exception type.

Questions to confirm:

Is this intentional? If so, consider adding a comment explaining why metadata-level OCE handling differs from ClientRetryPolicy.

Should ClientRetryPolicy be aligned in a follow-up PR to also treat non-user OCE as a regional failure for metadata requests?

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:58:37Z

+        }
+
+        [TestMethod]
+        public async Task InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries()


🟢 Suggestion — Missing test for HttpRequestException retry exhaustion in BootstrapperCore

The CosmosException path has an exhaustion test (InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries) that verifies the exception propagates after MaxInitializationRetries + 1 attempts. The HttpRequestException path has no equivalent exhaustion test — only a retry-and-succeed test.

While the two catch blocks are structurally symmetric, they are separate code paths with independent when guards. A bug in the HttpRequestException catch's guard or retry logic would not be caught by the existing CosmosException exhaustion test.

Suggestion: Add InitializeAsync_WithHttpRequestException_ShouldThrowAfterMaxRetries mirroring the CosmosException exhaustion test.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:58:53Z

+        /// regions left to try; <see cref="ShouldRetryResult"/> with <c>ShouldRetry = false</c> otherwise,
+        /// allowing the exception to propagate to the operation-level retry policy.
+        /// </returns>
+        private ShouldRetryResult HandleRegionalFailure()


💬 Observation — Clean consolidation with a subtle behavioral change

HandleRegionalFailure() cleanly consolidates the "mark + increment" pattern and replaces 4 separate code blocks. The structural improvement is welcome.

One subtle behavioral change worth noting: previously, when IncrementRetryIndexOnUnavailableEndpointForMetadataRead() returned false (retries exhausted), the code fell through to this.throttlingRetryPolicy.ShouldRetryAsync(exception, cancellationToken). Now it returns ShouldRetryResult.NoRetry() directly.

This is functionally equivalent today because ResourceThrottleRetryPolicy only retries 429s and returns NoRetry for the status codes handled here (503, 500, 410/LeaseNotFound, 403/DatabaseAccountNotFound). However, it changes the implicit contract: the throttling policy is no longer consulted after region exhaustion. If ResourceThrottleRetryPolicy ever gains broader retry logic in the future, this path would be missed.

Not blocking — just calling it out for awareness.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:59:05Z

+        }
+
+        [TestMethod]
+        public async Task InitializeAsync_WithCosmosException_ShouldThrowAfterMaxRetries()


🟢 Suggestion — Add test for non-retryable CosmosException to document intent

The catch (CosmosException) in BootstrapperCore catches all status codes (see separate comment). Whether this is intentional or a bug, a test explicitly exercising a non-regional CosmosException (e.g., 404 NotFound) would document the expected behavior:

If retrying all exceptions is intended: the test asserts 4 attempts and adds a comment explaining why

If only regional errors should retry: the test asserts the exception propagates immediately, and the when guard gets a status code filter

Either way, a test closes the ambiguity.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:59:21Z

+        /// the endpoint as unavailable in the <see cref="IGlobalEndpointManager"/> when
+        /// a regional failure is detected.
+        /// </summary>
+        private Uri locationEndpoint;


💬 Observation — Design: Marking endpoint unavailable affects ALL reads, not just metadata

MarkEndpointUnavailableForRead() signals the LocationCache to deprioritize this endpoint for all subsequent read operations (queries, point reads, change feed), not just metadata requests. The endpoint moves to the end of the preferred list and self-heals after the 5-minute TTL (DefaultUnavailableLocationsExpirationTimeInSeconds = 300).

This is the correct trade-off: if a region is failing for metadata reads (partition key range lookups), it's very likely failing for data reads too. The impact is soft (deprioritized, not blocked) and self-healing. This aligns with how ClientRetryPolicy.ShouldRetryOnEndpointFailureAsync already works.

Also confirmed: double-marking (from both this policy and ClientRetryPolicy when the exception propagates) is benign — LocationCache.MarkEndpointUnavailable uses ConcurrentDictionary.AddOrUpdate, which is idempotent (just refreshes the TTL timestamp).

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:59:40Z

+                        MaxInitializationRetries,
+                        this.sleepTime);
+
+                    await Task.Delay(this.sleepTime).ConfigureAwait(false);


🟢 Suggestion — Task.Delay without CancellationToken

The review rules require verifying cancellation token propagation on all async methods. Both Task.Delay(this.sleepTime) calls in the new catch blocks don't accept a cancellation token — if the caller wants to cancel during the 15-second retry delay, they can't.

The InitializeAsync() method itself doesn't take a CancellationToken parameter (pre-existing design), so there's no token to pass here. This is a pre-existing limitation of the Bootstrapper interface, not introduced by this PR.

Consider adding CancellationToken support to InitializeAsync() in a follow-up if the team wants cancellable bootstrap initialization.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

xinlian12 · 2026-04-30T19:59:56Z

✅ Review complete (50:07)

Posted 9 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

Operation level retry on metadata call failure

694589a

github-actions Bot reviewed Apr 13, 2026

View reviewed changes

yash2710 marked this pull request as ready for review April 30, 2026 19:06

Merge branch 'main' into users/trivediyash/addressRefreshCrossRegion

59b64b5

yash2710 requested review from FabianMeiswinkel, Pilchie, adityasa, jeet1995, khdang, kirankumarkolli, kundadebdatta, kushagraThapar, neildsh, sboshra and xinlian12 as code owners April 30, 2026 19:06

yash2710 changed the title ~~Cross-Region Retry for Metadata Request Failures~~ Metadata Retry: Adds Cross-Region Retry for Metadata Request Failures Apr 30, 2026

yash2710 changed the title ~~Metadata Retry: Adds Cross-Region Retry for Metadata Request Failures~~ Metadata Retry: Adds Cross-Region Operation-level Retry for Metadata Request Failures Apr 30, 2026

xinlian12 reviewed Apr 30, 2026

View reviewed changes

Conversation

yash2710 commented Apr 13, 2026

Description

Problem

Changes

Testing

Type of change

Closing issues

Uh oh!

azure-pipelines Bot commented Apr 13, 2026

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azure-pipelines Bot commented Apr 30, 2026

Uh oh!

azure-pipelines Bot commented Apr 30, 2026

Uh oh!

NaluTripician commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinlian12 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot left a comment •

edited

Loading